摘要 扩散模型在图像生成任务中表现出较高的视觉保真度,但在图像编辑方面仍面临用户意图理解偏差、局部细节控制不足、交互响应滞后等的问题.为此,文中提出基于大语言模型双向协同的跨模态交互式图像编辑方法(Cross-Modal Interactive Image Editing Method Based on Bidirectional Collaboration between Large Language Models and User Interaction, BiC-LLM),其核心是一种双向协同控制机制,将大语言模型自顶向下的高级语义引导与用户直接参与的自底向上底层视觉控制有机融合,通过语义增强、特征解耦与动态反馈机制提升图像编辑的可控性与精度.首先,设计层次化语义驱动模块,使用大语言模型对用户输入文本进行语义解耦与推理,生成细粒度语义向量,精准理解用户意图.然后,构建视觉-结构解耦的动态控制模块,结合多层视觉特征提取器与对象级建模,实现图像全局结构与局部风格的独立控制.最后,引入实时交互机制,支持掩膜标注与参数调节,实现图像编辑过程的动态优化.在LSUN、CelebA-HQ、COCO数据集上的实验表明,BiC-LLM在文本一致性、结构稳定性与交互控制方面均较优,能实现复杂场景下的多对象语义编辑,并保持非编辑区域的内容一致性,由此验证其在图像编辑任务中的有效性与鲁棒性.
Abstract:Diffusion models exhibit high visual fidelity in image generation tasks. However, they are confronted with critical challenges in image editing, such as ambiguity in user intent interpretation, insufficient control over local details, and lag in interactive response. To address these issues, a cross-modal interactive image editing method based on bidirectional collaboration with large language models(BiC-LLM) is proposed. A bidirectional collaboration mechanism is introduced as its core. The top-down semantic guidance from large language models is combined synergistically with bottom-up direct interaction from users. Therefore, controllability and precision in image editing are fundamentally enhanced by employing semantic enhancement, feature decoupling and a dynamic feedback mechanism. First, a hierarchical semantic-driven module is designed. The user-input text is decoupled and reasoned by the large language model, and fine-grained semantic vectors are generated to interpret user intent precisely. Second, a dynamic control module for vision-structure decoupling is constructed. Multi-level visual feature extractors and object-level modeling are combined to achieve independent control over global structure and local appearance. Finally, a real-time interaction mechanism is introduced to enable users to dynamically intervene in the editing process through mask annotations and parameter adjustments, thereby supporting iterative optimization. Experiments on LSUN, CelebA-HQ, and COCO datasets demonstrate that BiC-LLM significantly outperforms baseline models in terms of textual consistency, structural stability, and interactive controllability. Moreover, BiC-LLM effectively enables multi-object semantic editing in complex scenes while preserving the integrity of unedited regions, demonstrating its robustness and effectiveness in image editing tasks.
石慧, 金聪慧. 基于大语言模型双向协同的跨模态交互式图像编辑方法[J]. 模式识别与人工智能, 2025, 38(7): 596-612.
SHI Hui, JIN Conghui. Cross-Modal Interactive Image Editing Based on Bidirectional Collaboration with Large Language Models. Pattern Recognition and Artificial Intelligence, 2025, 38(7): 596-612.
[1] SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. Proceedings of the Machine Learning Research, 2015, 37: 2256-2265. [2] 黄金杰,刘彬.基于双重优化稳定扩散模型的文本生成图像方法.模式识别与人工智能, 2025, 38(4): 359-373. (HUANG J J, LIU B. Text-to-Image Generation via Dual Optimization Stable Diffusion Model. Pattern Recognition and Artificial Intelligence, 2025, 38(4): 359-373.) [3] MENG C L, HE Y T, SONG Y, et al. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations[C/OL].[2025-05-15]. https://arxiv.org/pdf/2108.01073. [4] BROOKS T, HOLYNSKI A, EFROS A A. InstructPix2Pix: Lear-ning to Follow Image Editing Instructions // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18392-18402. [5] HERTZ A, MOKADY R, TENENBAUM J, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control[C/OL].[2025-05-15]. https://openreview.net/pdf?id=_CDixzkzeyb. [6] YANG B X, GU S Y, ZHANG B, et al. Paint by Example: Exemplar-Based Image Editing with Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18381-18391. [7] 吴福祥,程俊.基于自编码器生成对抗网络的可配置文本图像编辑.软件学报, 2022, 33(9): 3139-3151. (WU F X, CHENG J. Configurable Text-Based Image Editing by Autoencoder-Based Generative Adversarial Networks. Journal of Software, 2022, 33(9): 3139-3151.) [8] RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22500-22510. [9] KUMARI N, ZHANG B L, ZHANG R, et al. Multi-concept Customization of Text-to-Image Diffusion // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 1931-1941. [10] 夏垚铮,郝蕾,郑宛露,等.基于语义分离和特征融合的人脸编辑方法.计算机辅助设计与图形学学报, 2025, 37(3): 414-426. (XIA Y Z, HAO L, ZHENG W L, et al. An Independent Semantic and Fused Latent Model for Local Face Editing. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(3): 414-426.) [11] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 1316-1324. [12] LIAO W T, HU K, YANG M Y, et al. Text to Image Generation with Semantic-Spatial Aware GAN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 18166-18175. [13] SHI Y J, XUE C H, LIEW J H, et al. DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2024: 8839-8849. [14] NGUYEN T T, NGUYEN Q, NGUYEN K, et al. SwiftEdit: Light-ning Fast Text-Guided Image Editing via One-Step Diffusion // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 21492-21501. [15] DALVA Y, VENKATESH K, YANARDAG P. FluxSpace: Disentangled Semantic Editing in Rectified Flow Models// Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 13083-13092. [16] LIU Z C, YU Y, OUYANG H, et al. MagicQuill: An Intelligent Interactive Image Editing System // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 13072-13082. [17] ZHANG L M, RAO A Y, AGRAWALA M. Adding Conditional Control to Text-to-Image Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023 : 3813-3824. [18] MOU C, WANG X T, XIE L B, et al. T2I-Adapter: Learning Adap-ters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4296-4304. [19] LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2022: 11451-11461. [20] CHANG H W, ZHANG H, JIANG L, et al. MaskGIT: Masked Generative Image Transformer // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11305-11315. [21] CHEN X, HUANG L H, LIU Y, et al. AnyDoor: Zero-Shot Object-Level Image Customization // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 6593-6602. [22] SHIN C, KIM J H, LEE C H, et al. Edit-A-Video: Single Video Editing with Object-Aware Consistency. Proceedings of the Machine Learning Research, 2024, 222: 1215-1230. [23] SRIVASTAVA A, MENTA T R, JAVA A, et al. REEDIT: Multimodal Exemplar-Based Image Editing // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2025: 929-939. [24] LAI B L, JUEFIE-XU F, LIU M, et al. Unleashing In-Context Learning of Autoregressive Models for Few-Shot Image Manipulation // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. USA: IEEE, 2025: 18346-18357. [25] HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 6840-6851. [26] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 10674-10685. [27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [28] SAPARINA I, LAPATA M. Improving Generalization in Semantic Parsing by Increasing Natural Language Variation // Proc of the 18th Conference of the European Chapter of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2024: 1178-1193. [29] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfor-mer. Journal of the Machine Learning Research, 2020, 21(140): 1-67. [30] OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: Lear-ning Robust Visual Features without Supervision[C/OL].[2025-05-15]. https://openreview.net/pdf?id=a68SUt6zFt. [31] ZHANG R, ISOLA P, EFROS A A, et al. The Unreasonable Effec-tiveness of Deep Features as a Perceptual Metric // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 586-595. [32] YU F, SEFF A, ZHANG Y D, et al. LSUN: Construction of a Large-Scale Image Dataset Using Deep Learning with Humans in the Loop[C/OL].[2025-05-15]. https://arxiv.org/pdf/1506.03365. [33] KARRAS T, AILA T, LAINE S, et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation[C/OL].[2025-05-15]. https://openreview.net/pdf?id=Hk99zCeAb. [34] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755. [35] GHIASI G, CUI Y, SRINIVAS A, et al. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 2917-2927. [36] BROWN A, FU C Y, PARKHI O, et al. End-to-End Visual Editing with a Generatively Pre-trained Artist // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 18-35. [37] YU Q F, CHOW W, YUE Z Q, et al. AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 26125-26135. [38] ZHU P H, ABDAL R, QIN Y P, et al. SEAN: Image Synthesis with Semantic Region-Adaptive Normalization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 5103-5112.